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Abstract — This paper studies the problem of accurately re- 
covering a sparse vector f3* from highly corrupted linear mea- 
surements y — X/3* + e* +w where e* is a sparse error vector 
whose nonzero entries may be unbounded and w is a bounded 
noise. We propose a so-called extended Lasso optimization which 
takes into consideration sparse prior information of both p* and 
e*. Our first result shows that the extended Lasso can faithfully 
recover both the regression as well as the corruption vector. Our 
analysis relies on the notion of extended restricted eigenvalue 
for the design matrix X. Our second set of results applies to 
a general class of Gaussian design matrix X with i.i.d rows 
Af(0, £), for which we can establish a surprising result: the 
extended Lasso can recover exact signed supports of both /?* 
and e* from only fi(fclogplogn) observations, even when the 
fraction of corruption is arbitrarily close to one. Our analysis 
also shows that this amount of observations required to achieve 
exact signed support is indeed optimal. 

I. Introduction 

One of the central problems in statistics is the problem of 
linear regression in which the goal is to accurately estimate 
the regression vector /3* € W from the noisy observations 

y = X/3* + w, (1) 

where X £ M. nxp is the measurement or design matrix, and 
w E M.™ is the stochastic observation vector noise. A particular 
situation recently attracted much attention from the research 
community concerns with the model in which the number of 
regression variables p is larger than the number of observations 
n (p > n). In such circumstances, without imposing some 
additional assumptions for this model, it is obvious that the 
problem is ill-posed, and thus the linear regression is not 
consistent. Accordingly, there have been various lines of work 
on high dimensional inference based on imposing different 
types of structure constraints such as sparsity and group 
sparsity (e.g. 0, 0, 0, 0, 0, 0, Q, 0, 0, QD|, 
BED, CD, Q3J, (111, |Q3]). Among them, the most popular 
model focused on sparsity assumption of the regression vector. 
To estimate j3, a standard method, namely Lasso 0, was 
proposed to use li -penalty as a surrogate function to enforce 
sparsity constraint. 

rnin^||y-X^||^ + All/31^, (2) 
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where A is the positive regularization parameter and the l\- 
norm of the regression vector is H/?^, defined as H/?^ = 

Within the past few years, there has been numerous studies 
to understand the t\ -regularization aspect of sparse regression 
models (e.g. 0, 0, Q, 0, 0, (TO), ifTTV ). These works 
are mainly characterized by the type of the loss functions 
considered. For instance, authors 0, ifTTl seek to obtain a 
regression estimate (3 that delivers small prediction error while 
others [10], [6| [11 J seek to produce a regressor with minimal 
parameter estimation error, which is measured by the ^ 2 -norm 
of 0-/3*). Another line of work (e.g. [16), 0, 0) considers 
the variable selection in which the goal is to obtain an estimate 
that correctly identifies the support of the true regression 
vector. To achieve low prediction or parameter estimation loss, 
it is now well known that it is both sufficient and necessary to 
impose certain lower bounds on the smallest singular values 
of the design matrix (e.g. Q, ifTOl ). while the notion of small 
mutual coherence for the design matrix (e.g. 0, 0, 0) is 
required to achieve accurate variable selection. 

We notice that all previous work relies on the assumption 
that the observation noise has bounded energy. Without this 
assumption, it is very likely that the estimated regressor is 
either not reliable or we fail to identify the correct support. 
With this observation in mind, in this paper, we extend the 
linear model ([TJi by considering the noise with unbounded 
energy. It is clear that if all entries of y are corrupted by large 
errors, then it is impossible to faithfully recover the regression 
vector j3*. However, in many practical applications such as 
face recognition, acoustic recognition and dense sensor net- 
work, only a portion of the observation vector is contaminated 
by gross error. Formally, we have the mathematical model 

y = Xf3* + e* + w, (3) 

where e* £ R" is the sparse error whose locations of nonzero 
entries are unknown and whose magnitudes can be arbitrarily 
large whereas w is the conventional noise vector with bounded 
entries. In this paper, we assume that w has a multivariate 
Gaussian Af(0,cr 2 I nxn ) distribution. This model also includes 
as a special case the missing data problem in which all the 
entries of y is not fully observed, but some are missing. 
This problem is particularly important in computer vision and 
biology applications. If some entries of y are missing, the 
nonzero entries of e* whose locations are associated with the 
missing entries of the observation vector y have the same 
values as entries of y but with reverse polarity. 

The problems of faithfully recovering data under gross error 
has gained increasing attentions recently with many interesting 
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practical applications (e.g. ifTTl . lfT8l . [19|) as well as theoret- 
ical consideration (e.g. |20|, [21], ll22l . (23)). Another recent 
line of research on recovering the data from grossly corrupted 
measurements has been also studied in the context of robust 
principal component analysis (RPCA) (e.g. flU, E3, EH). 
Let us consider several examples as illustrations. 

• Face recognition. The model Q has been proposed by 
Wright et al. [17| in the context of face recognition. 
In this problem, a face test sample y is assumed to 
be represented as a linear combination of training faces 
in the dictionary X. Hence, y = X(3 where (3 is the 
coefficient vector used for classification. However, it is 
often the case that the testing face of interest is occluded 
by unwanted objects such as glasses, hats, scarfs, etc. 
These occlusions, which occupy a portion of the test face, 
can be considered as the sparse error e* in the model 0). 

• Subspace clustering. An important problem in high- 
dimensional data analysis is to cluster the data points into 
multiple subspaces. A recent work of Elhamifar and Vidal 
1 18 1 show that this problem can be solved by expressing 
each data point as a sparse linear combination of all other 
data points. Coefficient vectors recovered from solving 
the Lasso problems are then employed for clustering. 
If the data points are represented as a matrix X, then 
we wish to find a sparse coefficient matrix B such that 
X = XB and diag(S) = 0. When the data is missing 
or contaminated by outliers, the authors formulate the 
problem as X = XB + E and minimize a sum of two 
£i -norms with respect to both B and E lfl8l . 

• Sparse graphical model estimation. Given a random 
vector ieR p with unkown covariance matrix S, the goal 
is to estimate £ or its precision matrix £1 = from n 
independent copies of x: xi, ...,x n £ W. Assuming that 
the matrix ft is sparse, Meinshausen and Biihlmann [7] 
propose to solve the following Lasso problem 

rrriii~||X-XB||^-|-A||B|| x s.t. diag(B) = 0, 

where X = [x^ , ..., x^\. The precision matrix il can be 
estimated via the coefficient matrix B. When the data X 
is partially observed/missing, a more robust method is to 
take into account the sparsity assumption and minimize 

min \\X - XB - E\\% + X b \\B\h + A e ||B|| X 
B ,E Zn 

subject to diag(-B) = 0, where E represents partially 
missing information. Though this problem is quite differ- 
ent from the aforementioned subspace clustering problem, 
the technical approach is considerably similar. 

• Sensor network. In this model, a network of sensors 
collect measurements of a signal /3* independently by 
simply projecting f3* onto the row vectors of a sensing 
matrix X, yi — (Xi,f3*) |27l . The measurements yi are 
then sent to the central hub for analysis. However, it is 
highly likely that a small percentage of sensors might 
fail to send the measurements correctly and sometimes 
even report totally irrelevant measurements. Therefore, it 
is more appropriate to employ the observation model in 
^ than the model in (QJ. 



It is worth noticing that in the aforementioned applications, 
e* always plays the role as the sparse (undesired) error. 
However, in other applications, e* might actually contain 
meaningful information, and thus necessary to be recovered. 
An example of this kind of problem is signal separation, 
in which (3* and e* are considered as two distinct signal 
components (e.g. video or audio). Furthermore, in applications 
such as classification and clustering, the assumption that the 
test sample y is a linear combination of a few training samples 
in the dictionary (playing the role of the design matrix) X 
might be violated. The sparse component e* can thus be seen 
as the compensation for the linear regression model mismatch. 

Given the observation model ([T]i and the sparsity assump- 
tions on both regression vector f3* and error e* , we propose 
the following convex minimization to estimate the unknown 
regression vector f3* as well as the error vector e*. 

1 2 

min 7T \\V - X/3- e\\ 2 + X n .fi \\I3\\ 1 + A„ !e | ] o | ] x , (4) 
p,e zn 

where X n ,p and A„. e are positive regularization parameters. 
This optimization, which we call extended Lasso, can be seen 
as a generalization of the Lasso program. Indeed, by setting 
X n ,e = 0, ([6j returns to the standard Lasso. The additional 
regularization associated with the error e encourages sparsity 
of the reconstructed vector, where the penalty parameter A„ e 
controls its sparsity level. In this paper, we focus on the fol- 
lowing questions: what are necessary and sufficient conditions 
for the ambient dimension p, the number of observations n, 
the sparsity index k of the regression (3* and the fraction of 
corruption in e* so that (i) the extended Lasso is able (or 
unable) to recover the exact support sets of both (3* and e*? 
(ii) the extended Lasso is able to recover (3* and e* with 
small prediction error and parameter error? We are particularly 
interested in understanding the asymptotic situation where the 
the fraction of error gets arbitrarily close to 100%. 

In this paper, we assume normalization of the design matrix 
X. Specifically, we assume the ^2-norm of columns of the 
matrix X are 9(y / n). Moreover, without loss of generality, 
we use the following observation model in replacement for 
the model in ([3) 

y = Xfi* + y/7ie* + w. (5) 

As we can see, columns of both the design matrix X and 
the matrix yn/„ xn has the same scale. Thus, this model's 
change only helps our results in the next sections to be more 
interpretable. The optimization |4]) is now converted to the 
following problem 

1 2 

min — \\y -Xfi- Vne\\ 2 + X n ^ \\f3\\ 1 + A„ !e ||e|| x , (6) 

Previous work. The problem of recovering the estimation 
vector (3* and error e* is originally proposed by Wright et al. 
in the appealing paper [17] and analyzed by Wright and Ma 
l20l . In the absence of the stochastic noise w in the observation 
model ([3]), the authors propose to estimate (/?*, e*) by solving 
the following linear program 

min ||/3|| x + ||e|| x s.t. y = Xf3 + y/ne. (7) 
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From a different viewpoint, in the intriguing paper 11281 . 
Lee et al. study a general loss function model. To obtain more 
flexibility in controlling the undesirable influence of the model, 
they introduce a case-specific parameter vector e € W 1 for the 
observation vectors and modify the optimization to take into 
account this parameter. Interestingly, the model turns out to 
be coincident with (|6]i when applying to the linear regression 
problem with Lasso penalty. Extensive simulations have shown 
that the model ([6]) is considerably robust to noise. However, 
no theoretical analysis is provided in the paper. 

In another direction, the problem of robust Lasso under 
corrupted observations is also carefully investigated by Wang 
et al. J29j- In this appealing paper, instead of using the 
quadratic loss function as in Lasso, the authors propose to 
employ LAD-Lasso criterion: 

p 

wmWy-Xp^ + ^Xjlxj]. (8) 

3=1 

This optimization combines the LAD criterion and Lasso 
penalty, where the first term is designed to be robust to outliers 
and the second term again promotes the sparse representation 
of the estimator. However, due to the lack of the quadratic 
loss that enforce the estimation to be consistence with the 
observation in ^2-norm sense, this optimization might not 
guarantee to deliver a solution that satisfies small prediction 
error. 

On the theoretical side, the result of Etfl is asymptotic in 
nature. The analysis reveals that for a class of Gaussian design 
matrix with i.i.d entries, the optimization (|7| can recover 
(/?*, e*) precisely with high probability even when the fraction 
of corruption is arbitrarily close to one. However, the result 
only holds under rather stringent conditions. In particularly, 
the authors require the number of observations n grow pro- 
portionally with the ambient dimension p, and the sparsity 
index A: is a very small portion of n. These conditions is of 
course far from the optimal bound in compressed sensing (CS) 
and statistics literature (recall k < 0(nj log p) is sufficient in 
conventional analysis (e.g. lf30l . O). 

Another line of work has also focused on the optimization 
Q. In both Laska et al. |[T9ll and Li et ah, lETl . the authors 
establish that for Gaussian design matrix X, if n > C(k + 
s) \ogp where s is the sparsity level of e*, then the recovery 
is exact. This follows from the fact that the combination 
matrix [X, I] obeys the restricted isometry property, a well- 
known property in compressed sensing used to guarantee exact 
recovery of sparse vectors via i\ -minimization. These results, 
however, do not allow the fraction of corruption to come close 
to unity. Also related to our paper is recent work by Studer et 
ah, II3TI ll32l in which the authors establish different results 
for deterministic design matrix. 

Among the previous work, the most closely related to our 
current paper are recent results by Li [23 1 and Nguyen et al. 
1 22 1 in which a positive regularization parameter A is em- 
ployed to control the sparsity of e*. Using different methods, 
both sets of authors show that as A is deterministically selected 
to be l/y / Iogp and X is a sub-orthogonal matrix, whose 
columns are selected uniformly at random from columns of an 



orthogonal matrix, then the solution of following optimization 
([9} is exact even a constant fraction of observation is corrupted. 
Moreover, ll23l establishes a similar result with Gaussian 
design matrix in which the number of observations is only 
on the order of k log p — a level that is known to be optimal 
in both CS and statistics community. 

min \\0\L + A ||e||, s.t. y = Xj5 + \fne. (9) 

Our contribution. This paper considers a general setting 
in which the observations are contaminated by both sparse 
and dense errors. We allow the corruptions to linearly grow 
with the number of observations and have arbitrarily large 
magnitudes. We establish a general scaling of the quadruplet 
(n, p, k, s) such that the proposed extended Lasso stably recov- 
ers both the regression and the corruption vector. Of particular 
interest to us are the answer to the following questions: 

(a) First, under what scalings of (n,p,k,s) does the ex- 
tended Lasso obtain the unique solution with small 
estimation error? 

(b) Second, under what scalings of (n, p, k) does the ex- 
tended Lasso obtain the exact signed support recovery 
even when almost all observations are corrupted? 

(c) Third, under what scalings of (n,p, k, s) that no solution 
of the extended Lasso specifying the correct signed 
support exists? 

To answer for the first question, we introduce a notion of 
extended restricted eigenvalue for a matrix [X, I] where / 
is the identity matrix. We show that this property is satisfied 
for a general class of random Gaussian design matrices. The 
answers to the last two questions requires stricter conditions on 
the design matrix. In particular, for random Gaussian design 
matrix with i.i.d rows Af(Q, £), we rely on two standard as- 
sumptions: invertibility and mutual incoherence. Our analysis 
in this setting is relied on the elegant technique introduced by 
Wain wright |8|. 

If we denote Z = [X, I] where / is an identity matrix 
and P — [P* , e* ] T , then the observation vector y is 
reformulated as y = Zf3 + w, which is the same as the 
standard Lasso model. However, previous results (e.g. |10|, 
1 8 1) applying to random Gaussian design matrix are irrelevant 
to this setting since Z no longer behaves like a Gaussian 
matrix. To establish the theoretical analysis, we need a deeper 
study on the interaction between the Gaussian and identity 
matrices. By exploiting the fact that the matrix Z consists 
of two components where one has a special structure, our 
analysis reveals an interesting phenomenon: extended Lasso 
can accurately recover both the regressor /3* and the corruption 
e* even when the fraction of corruption is up to 100%. 
We measure the recoverability of these variables under two 
criterions: parameter accuracy and feature selection accuracy. 
Moreover, our analysis can be extended to the situation in 
which the identity matrix can be replaced by a tight frame D 
as well as extended to other models such as group Lasso or 
matrix Lasso with sparse error. 

Notation. We summarize here some standard notation em- 
ployed throughout the paper. We reserve T and S as the 
sparse support of /?* and e* , respectively. Given a design 
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matrix X £ E" xp and subsets S and T, we use X$t to 
denote the |5| x |T| submatrix obtained by extracting those 
rows indexed by S and columns indexed by T. For a vector 
h £ K p , we use the conventional notations for l\- and £2- 
norm of h as \\h\U = £? =1 N and ||A|| 2 = (ELi ^?) 1/2 > 
respectively. For a matrix X £ ]R" xp , we denote ||X|| and 
11-^ II 00 as tne P era tor norms. In particular, ||X|| is denoted 
as the spectral norm and || X\\ as the / operator norm: 

\\ X \\oo = maX iEf=l \ X ij\- 

We use the notation Ci, C2, ci, C2, etc., to refer to positive 
constants, whose value may change from line to line. Given 
two functions / and g, the notation f(n) = 0(g(n)) means 
that there exists a constant c < +00 such that f(n) < cg(n); 
the notation f(n) — £l(g(n)) means that f(n) > cg(n) and 
the notation f(n) = Q(g(n)) means that f(n) — 0(g(n)) and 
f(n) — Q(g(n)). The symbol f(n) = o(g(n)) indicates that 
f(n)/g(n) -+ 0. 

Organization. The remainder of this paper is structured 
as follows. Section |TT| provides the main results, detailed 
discussions and their consequences. Section III performs ex- 
tensive experiments to validate theoretical results presented 
in the previous section. Section llV] provides analysis of the 
estimation error, whereas Sections |V| and VI deliver proofs 
of the necessary and sufficient conditions for the exact signed 
support recovery. Several technical aspects of these proofs and 
some well-known concentration inequalities are presented in 



the Appendix. We conclude the paper in Section VII with more 
discussion. 

II. Main results 

In this section, we provide precise statements for the main 
results of this paper. In the first sub-section, we establish 
the parameter estimation and provide a deterministic result 
which is based on the notion of extended restricted eigenvalue. 
We further show that the random Gaussian design matrix 
satisfies this property with high probability. The next sub- 
section considers feature estimation. We establish conditions 
for the design matrix such that the solution of the extended 
Lasso has the exact signed supports. 

A. Parameter estimation 

As in conventional Lasso, to obtain a low parameter es- 
timation bound, it is necessary to impose conditions on the 
design matrix X. In this paper, we introduce the notion of 
extended restricted eigenvalue (extended RE) condition. Let 
C be a restricted set, we say that the matrix X satisfies the 
extended RE assumption over the set C if there exists some 
ki > such that 

A=\\Xh + y/Ef\\ 2 >Ki(\\h\\ 2 + \\f\\ 2 ) for all (hJ)eC, 
V n 

(10) 



where the restricted set 1 
as follows 



of interest is defined with A : = 



C := {(h,f) efx R™ I 

II'»t«|| 1 + A||/ s .|| 1 <3||Mi + 3A||/s|| 1 }. (ID 



This assumption is a natural extension of the restricted 
eigenvalue condition and restricted strong convexity consid- 
ered in ifTOl , [ 33 1 and [34- 1 . In the absence of a vector / 
in the equation (lOi and in the set C, this condition returns 
to the restricted eigenvalue defined in flTOl . As discussed in 
more detail in [10] and 1351 . restricted eigenvalue is among 
the weakest assumption on the design matrix such that the 
solution of the Lasso is consistent. 

With this assumption at hand, we now state the first theorem 

Theorem 1. Consider the optimal solution (/?, e) to the op- 
timization problem (j6| with regularization parameters chosen 
as 



X 



2 \\X*w\ 



n,/3 



7 



and A„., 



2 INI 



(12) 



where 7 £ (0, 1]. Assuming that the design matrix X obeys 
the extended RE, then the error set [h, /) = (/? — /?*, e — e*) 
is bounded by 



I/II2 < 3k i 



(13) 



There are several interesting observations from this theorem 

1) The error bound naturally split into two components 
related to the sparsity indices of f3* and e*. In addition, 
the error bound contains three quantity: the sparsity indices, 
regularization parameters, and the extended RE constant. If the 
terms related to the corruption e* are omitted, then we obtain 
similar parameter estimation bound as in the standard Lasso 

(e.g. ED, ED). 

2) The choice of regularization parameters X n _p and A n e 
can be made explicitly: assuming to is a Gaussian random 
vector whose entries are A/"(0,<7 2 ) and the design matrix has 
y^-normed columns, it is clear that with high probability, 



A \\X*w\\ < 2 



: log P 



and -4= 

V™ 



it is sufficient to select A„ i( 3 
4, 



> 



Ml < 2 

Moo — 



Thus, 



r 2 log 



and A, 



> 



; log n 



3) At the first glance, the parameter 7 does not seem to have 
any meaningful interpretation and the setting 7 = 1 seems to 
be the best selection due to the smallest estimation error it can 
produce. However, this parameter actually controls the sparsity 
level of the regression vector with respect to the fraction of 
corruption. This relation is enforced via the restricted set C. 

In the following lemma, we show that the extended RE 
condition actually exists for a large class of random Gaussian 
design matrix whose rows are i.i.d zero mean with covariance 
S. Before stating the lemma, let us define some quantities 
operating on the covariance matrix S: C m i n := A m j n (E) is 
the smallest eigenvalue of S; C max := A max (E) is the largest 
eigenvalue of E; and £(£) := max^ E^ is the maximal entry 
on the diagonal of the matrix E. 

Lemma 1. Consider the random Gaussian design matrix 
whose rows are i.i.d A/"(0, E) and assume C max £(E) = 0(1). 
Select 



X 



7 



'logn 



(14) 
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Then with probability greater than 1 — Cj exp(— Cin), 
the matrix X satisfies the extended RE with parameter 
Ki = provided that n > C^p^-klogp and s < 

min |c*i ^ 2 |" e „ , ^2^1 /or some sma/Z constants C%, C 2 . 

We would like to offer a few remarks: 

1) The choice of parameter A is nothing special here. 
When the design matrix is Gaussian and independent with 
the Gaussian stochastic noise w, we can easily show that 
k II^Hloo ^ 2 v /£(£)(5 2 logp with probability at least 1 - 
2exp(— logp). Therefore, the selection of A follows from 
Theorem [TJ 

2) The proof of this lemma, shown in the Appendix, boils 
down to controling two terms 

• Restricted eigenvalue with X. 

\ H^lla+ll/lla > KrOTa+ll/lla) for a11 /) e C. 

• Mutual incoherence. The column space of the matrix X is 
incoherent with the column space of the identity matrix. 
That is, there exists some K m > such that 

-^| (Xh,f) | < K m (||/i|| 2 + ||/|| 2 ) 2 for all (hj) e C. 
v n 

If the incoherence between these two column spaces is suffi- 
ciently small such that 4k to < K r , then we can conclude that 
\\Xh + f\\l > fa - 2K m )(\\h\\ 2 + ||/|| 2 ) 2 . The small mutual 
incoherence property is especially important since it provides 
how the regression separates itself away from the sparse error. 

3) To simplify our result, we consider a special case of 
the uniform Gaussian design, in which £ = I pxp - In this 
situation, C m i n = C max = £(E) = 1. We have the following 
result which is a corollary of Theorem [TJ and Lemma [TJ 

Corollary 1 (Standard Gaussian design). Let X be a standard 
Gaussian design matrix. Consider the optimal solution (j3, e) 
to the optimization problem (|6| with regularization parameters 
chosen as 



A 



n,0 



7 



4 a 2 \ogp 



and X r 



a 2 log n 



(15) 



for 7 £ (0,1]. Also, assuming that n > Cklogp and s < 
min{Ci 2 i"g n ; C2T1} for some small constants C,C\,C2, 
Then with probability greater than l — c% exp(— Cin), the error 
set (h, f) = (/3 — /3* , e — e*) is bounded by 



|ft|| 2 + ll/ll 2 <384 




a 2 k\ogp a 2 s\ogn 



(16) 



Corollary [TJ reveals a remarkable result: by setting 7 = 
1/Vlog n, even when the fraction of corruption is linearly 
proportional with the number of samples n, the extended Lasso 
^ is still capable of recovering both coefficient vector /3* 
and corruption (missing) vector e* within a bounded error 
(15). Without the dense noise w in the observation model 
(3} (<t = 0), the extended Lasso actually recovers the exact 
solution. This is impossible to achieve with the standard Lasso. 
Furthermore, if we know in prior that the number of corrupted 
observations is on the order of 0(n/ \ogp), then selecting 
7 = 1 instead of 1/ log n will minimize the estimation error 
(see equation ( 16 1) of Theorem [TJ 



B. Feature selection with random Gaussian design 

In many applications, the feature selection criterion is more 
preferred [8] [5|. Feature selection refers to the property that 
the recovered parameter has the same signed support as the 
true regressor. In general, good feature selection implies good 
parameter estimation but the reverse direction does not usually 
hold. In this part, we investigate conditions for the design 
matrix and the scaling of (n,p, k, s) such that both regression 
and sparse error vectors satisfy these criterion. 

Consider the linear model ([3]l where X is the Gaussian 
random design matrix whose rows are i.i.d zero mean with 
covariance matrix E. It has been well known in the Lasso that 
in order to obtain feature selection accuracy, the covariance 
matrix S must obey two properties: invertibility and small 
mutual incoherence restricted on the set T. The first property 
guarantees that (|6]l is strictly convex, leading to the unique 
solution of the convex program, while the second property 
requires the separation between two components of S, one 
related to the set T and the other to the set T c must be 
sufficiently small. 

1) Invertibility. To guarantee uniqueness, we require Stt 
to be invertible. Particularly, let C m i n = X m i n (J^Tr), we 
require C m i n > 0. 

2) Mutual incoherence. For some 7 e (0, 1), 



- l T c T 



(£ Tr )~ 



< (1-7). 



(17) 



It is worth noting that these two invertibility and mutual 
incoherence properties are exactly the same as the condi- 
tions used to establish the exact signed support recovery 
in the standard Lasso (e.g Jj6j, 0, 0). 

Toward the end, we will also elaborate on three other 
quantities operating on the restricted covariance matrix T,tt- 
C max , which is defined as the maximum eigenvalue of T,tt- 

Cmax • A m ax 

(E TT ); and £> max and £> max , which are 
denoted as ^oo-norm of matrices and Ett : -D m ax := 

IKEtt)- 1 ^ and £>+„:= UEttIL. 

Our result also involves two other quantities operating on 
the conditional covariance matrix of {X^Xx) defined as 



- J T c |T 



S J 1 C Ji Yj rp }p Yl T J 1 C 



(18) 



They are defined as p„(E TC | T ) = max,-(E T o| T )jj and 
Pi(S T c| T ) = gmin i7 y[(£ T c| T )j, + {T, T c\ T )jj - 2(E T o| T )y], 
which we often denote with the shorthand notation p u and p\. 

We establish the following result for Gaussian random 
design whose covariance matrix E obeys the two assumptions. 

Theorem 2 (Achievability). Given the linear model ([5]) with 
random Gaussian design and the covariance matrix E sat- 
isfying invertibility and incoherence properties for any 7 G 
(0, 1), suppose that we solve the extended Lasso Q with 
regularization parameters obeying 



A n ,/3 — 



7 



8 /CT 2 77 lognlogp 



max{p u , L> m ax} (19) 



o~ 2 log n 



(20) 



6 



for some rj £ [ t n , 1). Assume that the sequence (n, p, k, s) 
and regularization parameters \ n ,p, A„ je satisfying s < rjn 
and n > max{n 1 ,n 2 } where ri\ and n 2 are defined as 



4(1 +e) p u , 1 , .J 9 , , 2 fT 2 C min 



\2 u 

A n,f) K 



and n 2 := 48(1 + e) 



77 max{p M ,D+ ax } 

(1 - V) 2 c miDl i 

-2 



/ 2ovTog^ Hog(p-fc)logn 

V K,eV n J 

for e £ (0,1)- In addition, suppose that minigr |/3*| > 
fpi^n.fi) and min ieS |e*| > f e (K,p, A„, e ) w/?ere 



/a := ciA' « + 20* 



/e C2A' B\[Cu 



with X' n> p := \ n .j3\ 



' a 2 log fc 

Cmin(« - S) 



s/c + kv sk 



and (21) 



c 3 A r , 



(22) 



' k \og(p — k) 
(1 — rj) 2 n 



i-l/2 



Then, the following properties holds with probability greater 
than 1 — cexp(— c' maxjlog n, \og{p — k)}): 

1) 77ze solution pair (73, e) 0/ f/ie extended Lasso ^ is 
unique and has the exact signed support. 



2) 



-norm bounds: 



'Woo < /e(A„,/3, A„^ e ). 



< 



fp(\i,p) and 



There are several interesting observations from the theorem. 

1) The first important observation is that the extended Lasso 
is robust to arbitrarily large and sparse error observation. 
Under the same invertibility and mutual incoherence assump- 
tions on the covariance matrix E as the standard Lasso, the 
extended Lasso program can recover both the regression vector 
and error with exact signed supports even when almost all 
the observations are contaminated by arbitrarily large error 
with unknown support. What we sacrifice for the corruption 
robustness is an additional log factor to the number of samples. 
We notice that when the error fraction is 0(n/ \ogn), only 
fi(fclog(p — k)) samples are sufficient to recover the exact 
signed supports of both the regression and sparse error vectors. 

2) We consider the special case with Gaussian random 
design in which the covariance matrix E = I pxp . In this 
case, entries of X is i.i.d. Af(0, 1) and we have quantities 

Cmin = Cmax — ^max = ^max = Pu = Pi = 1- ID 

addition, the invertibility and mutual incoherence properties 
are automatically satisfied with 7 = 1. The theorem implies 
that when the number of errors s is arbitrarily close to n, 
the number of samples n needed to recover the exact signed 
supports obeys = Q(klag(p— k)). Furthermore, Theorem 
[2] guarantees consistency in element- wise ^oo-norm of the 
estimated regression at the rate of 



= 



er 2 logp rjk log n log(p — k) 



As 77 is chosen to be l/yTogn (equivalent to establish 
s close to n/logn), the loo error rate is on the order of 

O(CTy^p), which is known to be the same as that of 
the standard Lasso. On the other hand, if we select r\ is 
arbitrarily close to unity — equivalently, s is close to n, the 

error rate is on the order of 0{a^^^v). This is 
naturally interpreted as the more fraction of corruption is on 
the observations, the higher reconstruction error we expect to 
get. What interesting is that we draw an explicit connection 
between the fraction of corruption and the reconstruction error 
obtained by the extended Lasso optimization. 

3) Corollary [T] though interesting, is not able to guarantee 
stable recovery when the fraction of corruption converges to 
unity. We show in Theorem [2] that this fraction can come 
arbitrarily close to unity by sacrificing a factor of logn for 
the number of samples. Theorem [2] also implies that there 
is a significant difference between recovery to obtain small 
parameter estimation error versus recovery to obtain correct 
variable selection. When the amount of corrupted observations 
is linearly proportional to n, recovering the exact signed 
supports require an increase from fi(felogp) (in Corollary [TJ 
to fi(£;logplogn) samples (in Theorem|2]). This behavior is 
captured similarly by the standard Lasso, as pointed out in the 
discussion after Corollary 2 of J8). 

Our next theorem show that the number of samples needed 
to recover accurately the signed support is actually optimal. 
That is, whenever the rescaled sample size satisfies a certain 
threshold, regardless of what the regularization parameters 
A n)/ g and X n e are selected, no solution of the extended 
Lasso can correctly identify the signed supports with high 
probability. 

Theorem 3 (Inachievability). Given the linear model ([3]) 
with random Gaussian design and the covariance matrix E 
satisfying invertibility and incoherence properties for any 7 £ 
(0, 1). Let 77,(5 £ (0, 1) and the sequence (n,p,k,s) satisfies 
s > rjn and n < max{ 774, 77,2} where n\ and 77,2 are defined 
as 



"1 



n 2 := 



2(1-6) Pl klog( P ~k) J 3 

(1 - V) Onax(2 " 7 ) 2 I 8 



(l-^) : 



■ <T 2 C n 



A 2 k 



(1 ~ 8) V Pi 
12 (1 - t?) 2 C ma 



1 | 2^1^ 



k log(n — s) log(p — k). 



Then, with probability tending to unity, no solution pair of the 
extended Lasso Q has the correct signed support. 

When the covariance matrix of the design matrix X is 
E = Ipxp, or equivalently, entries of X are i.i.d. Gaussian 
Af(0, 1). In addition, assume the regularization parameters 
A/3, 77 and X n>e are chosen from the families of (19i and (20i, 



respectively. That is, A 



/3,n 



= 8 



2 rj log n log p 



and A fi .„ = 



77. 



4a / a2 1 ° 6 " . Then, the theorem implies that the extended Lasso 
dTb is not able to achieve the correct signed support solution 



7 



whenever the number of observations is less than 

1 



Sublinear sparsity 



n < max < c\ 



1 



C'2 



(1-T))' 



V 

klog(p 



k\og(p-k), 

k) log(l — rf)n 



III. Illustrative simulations 

In this section, we provide several simulations to illus- 
trate the capability of the extended Lasso in recovering the 
exact regression signed support when a significant fraction 
of observations is corrupted by large error. Simulations are 
performed for a range of parameters (n,p,k,s) where the 
design matrix X is uniform Gaussian random whose rows are 
i.i.d. A/"(0, Ipxp)- For each fixed set of (n,p, k, s), we generate 
sparse vectors (3* and e* where locations of nonzero entries 
are distributed uniformly at random and their magnitudes are 
also Gaussian distributed. 

In our experiments, we consider varying problem sizes 
p = {128,256,512} and three types of regression sparsity 
indices: sublinear sparsity (k — Q.2p/ log(0.2p)), linear spar- 
sity (k = O.lp) and fractional power sparsity (k = 0.5p 75 ). 
In all cases, we fixed the error support size s = n/2. This 
means half of the observations is corrupted. By this selection, 
Theorem [2] suggests that we require the number of samples 
to be n > 2Ck\og(p — k)\ogn to guarantee exact signed 
support recovery. We choose — A6k\og(p — k) where 

parameter 8 is the rescaled sample size. This parameter control 
the success/failure of the extended Lasso. 

In the algorithm, we select \ n p = 2\ 



' log p log n 



and 



A, 



= 2 



a 2 log n 



as suggested by Theorem 



where the 

s a success 



noise level a = 0.1 is fixed. The algorithm repor 
if the solution pair has the same signed support as (/?*, e*). In 
Fig. [T[ each point on the curve represents the average of 100 
trials. 

As demonstrated by the simulation results, our extended 
Lasso is capable of recovering the exact signed support of both 
p* and e* even 50% of the observations are contaminated. 
Furthermore, up to unknown constants, our Theorem [2] and 
|3| match with simulation results. As the sample size < 
z7elog(p— k), the probability of success starts diving down to 
zero, implying the failure of the extended Lasso. 
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Fig. 1 

Probability of success in recovering the signed 
supports 



Moreover, it is clear that 



IV. Proof of Theorem[T]and related results 

Proof of Theorem [7] Since (/3,e) is the pair of the optimal 
solution of (J5J, we have 



1 

2n 



y — X/3 — y/ne 



+ K. 



A,, 



< ^ \\y - Xp* - Vne*\\l + X nJj 11/31! + A n , e Ue*^ 



(23) 



From h = j3 — /3* and / = e — e*, we can easily see that 

2 



y-X/3 



'ne 



: \\ y - X/3* - V^e*\\; 
2(w,Xh + yftif) + \\Xh + s/nf\ 



i/nii 



11)8*11!- ||)8* + 
IIMi -\\hTo\l. 



We also have a similar bound with e 



leilx^ll/slli-ll/s-ll!. 



8 



Putting these pieces into ( |23j ) we can bound 

< -(w,Xh + V^f) + A„, /3 (||Mi - IIMIi) 
+ A 

< 



Next taking advantage of ( [25] ) yields 



||XA|L> 



: J »G 



£,klogp 



\h\\ 



n,e(l\fs\\i - ll/s c Hi) 
1 „ „ „, „ 1 



W^n/ii 2 



X*w\\ 



n 



+ K,p(\\h>T 
1 



JWI1 + 
-\\h T c 



ILII/lli 

+ An,e(ll/S|li 



where we denote the shorthand notation £ := £(E). This 
inequality leads to 



< 



1 



\\ w \\oo + X n,e) ll/slli- (A n , e - 



1 

n 

U 

In 



l ' 



1 



M2 + II/II2 > 



Wfs4x 
(24) 



By the choices of X n ,/3 and A„ ie in the lemma, we have 



n 11 

Therefore 
1 

2n 



< 



2 A n,/3 



< 



2 



and 



^ + \^/||2<A„,^||Mli 



Vi,/3 



< 



3 1 

+ WfsWi - 2^ n > e Wfs°\\i ■ 

The left-hand side is greater than zero, thus the error pair 
(h, f) belongs to the set C defined in (Hi. Hence, by the 
extended RE, 

«f(l|/i|| 2 + ||/ll 2 ) 2 <3A n ^||Hli+3A n , e ll/slli 

< 3\ n ^Vk \\h\\ 2 + A n ,e-\All/ll 2 > 

where the last inequality follows from the crude l\ / £ 2 bound: 
ll^rlli < Vk\\h\\ 2 . If Xyfsjk < 1, the right-hand side is 
upper bounded by 3X n ^\/k(\\h\\ 2 + ||/|| 2 ). On the other hand, 
it is upper bounded by 3A Tl>e v'sCII ^|| 2 + II/II2) ^ Xy/s/k > 1. 
Combining these pieces together, we conclude 



n/iii 2 +ii/ii 2 <3«r 

which completes our proof. 



|A„ i(3 \/fc, A 



□ 



Proof of LemmaU^ Decompose — \\Xh + y^/l^ = 
i \\Xh\\ 2 2 + \\f\\l + ^ (Xh, /). In order to lower bound the 
left-hand side, our main tool is to control the lower bound of 
each term on the right-hand side. 

To establish a lower bound of — ||Xft,|| 2 , we leverage an 
appealing result of |33|. This result stated that for any Gaus- 
sian random matrix X with i.i.d. Af(0, E) rows, there exists 
universal positive constants c\ , c 2 such that the following 
inequality holds with probability greater than 1—C\ exp(— c^n) 



1 



\Xv\\ 2 > 



\\v\\ 2 -9y/t(V) 



logp 

n 



(25) 



for Vu € K p . Here, we remind the reader of the notation 

£(E) = max, , ,;>:., and C min = A m j n (S). 

We now apply this inequality for the error vector h in the set 
C. Since h £ C, we have 

IHIi <4||Mi + 3A||/ s || 1 <4VA||/i|| 2 + 3AVi||/|| 2 . 



1-27A 



36 



£fclogp 



£s\ogp 



\f\[ 



From the assumptions of the lemma and the choice of A in 
( 14 1, the two quantities in the brackets are strictly greater than 
0. Thus, -±= \\Xh\\ 2 + ||/|| 2 > i(\\h\\ 2 + ||/|| 2 ); or equivalent 

l\\xh\\l + \\f\\l >§(|W|;j + ]|/||*). 

In combination with the following lemma [2] we conclude that 

' * ^/|| 2 >^(IHI 2 + II/II 2 ), 



\Xh- 



as claimed. 



□ 



Lemma 2. Consider the random Gaussian design matrix X 
whose rows are i.i.d. A/"(0, E). Assume that n 2 C max £(E) = 
0(1). Suppose that s < C\ ^ v n and n > C 2 fclogp, then 
the following inequality holds with probability greater than 
1 — exp(— cn) 

^\(Xh,f)\<±(\\h\\l + \\f\\l)- 



., T q of size k 



Proof. Divide the set T c into subset Ti,T 2 
such that the first set T\ contains k entries of h indexed by 
T, the set T 2 contains k largest absolute entries of the vector 
Ht<=, T3 contains the second k largest absolution entries of 
hrc and so on. By the same strategy, we also divide the set 
S c into subset Si, S2, S r such that the first set Si contains 
s entries of / indexed by S and sets £2, S3,... are of size 
s' > s. 

We now have 



< max 

ij 



in 
1 



X 



SiTi 



El 



1/5 



Notice that the matrix Xg.^ is the random Gaussian matrix 
whose rows are Af(0, Et t )- By the random Gaussian matrix 
concentration in Lemma [14] in Appendix VIII-D we have with 
probability greater than 1 — 2 exp(— r 2 n/2), 



< 



-1I/2 




Taking the union bound over all possibility of Tj and Si, 
we have this inequality holds with probability at least 1 — 

2 (") (fc) exp(-r 2 n/2). Assuming that n > c^ 1 klog(p/k), 
we have (?) < < exp(cin). In addition, assuming 



9 



n > c^ l s'\og{n/s'), we have (™) < (f?) s < exp(c 2 n). 
Therefore, with sufficiently small constants c\ and c 2 , we get 



max^z H^T, II < \/G 
13 Jn 




with probability at least 1 — exp(— (r 2 /2 — c\ — c 2 )n) where 
we recall the definition of C max :— A max (£). 

A standard bound in [36| gives us: Y^l=3 W^Tt II2 — 
A; -1 / 2 ||ft-T c |li- In addition, since h belongs to the set C, 
\\hr4! < 3VA||ft|| 2 + 3AV5||/|| 2 - Hen ce, 



EH^II2< 2 W2+EIIHII2<5||% + 3aJ|| 

i=l i=3 ' 



/II 



Similar manipulations along with the choice of s' > s also 
yields 



Eii^ii 2 <^ 1/2 ii/5 C |Ii< 

leading to 



A V a 



7lWI 2 + 3 ll/ll2> 



Ell/ Si || 2 <fV^II% + 5||/|| 

i=l 

Hence, (X/i, /) | is upper bounded by 

£.1/2 



x [5||% + 3A A /-||/|| 2 



x max < A 




7 11% + 5 11/11 



7 }(\\h\\ 2 + \\f\\ 2 y 



We select s' = C ^2°^ n with an appropriate con- 
stant C. From the assumption that C max £(£) = 0(1) 
and a few algebraic manipulations, we can show that 



25Cmax max i A 



fc ' A V s 



7 f < c-i=. Therefore, 



= |<Xfc,/)|<c 




+ H (IN 2 + II/II 2 ) 2 



for sufficiently small r and n > Cs'. 



□ 



V. Proof of Theorem |2]- Achievability 

By KKT condition, P and e is a pair of solution of Q if 
and only if the following set of equations satisfies 



09) 



-X*(y~Xp-e) + X n ^z 



^-{y-Xp-e) + \ n , e z^ 



(26) 
(27) 



where 2^' and z( e ) are elements of the subgradients of the 
^1 norm evaluated at /3 and e, respectively. It has been well 
established that (P,e) is the unique solution to the extended 
Lasso program if 



±-X*(y-XP-e)=\ n , 0S gn0 l ) 



Mi 



nA„ 



: \X?(y-Xp-e)\<l 



for ^ £ 
for % = 0. 



and 



^ (j/i - Xifi - e~i) = A„ :G sgn(ei) for e t ^ 



|yj - XiP - e"i| < 1 for e, : = 0, 



(28) 



(29) 



We will show that under the assumptions of Theorem [2] 
the solution pair of the extended Lasso is given by (P,e) = 

(P* + h, e* + g) where h T ° = 0, g s - = and 



flT — (Xg cT Xs"T) 

x [X* S c T w S o + 



fn\i,eX* ST s S n ( e s) - n\ n ^ sgn(/3£)], 

(30) 



and 

3s 



—7=Xst(Xsc T X s <=t) 1 
V n 

x [Xg CT ii7se + Vn\ n ,eXs T sgii(e* s ) - nA„ !(9 sgn(^)] 
+ -i=u>s - A„. e sgn(eg). 

(31) 



The expressions of hx and 55 in the above equations 



are obtained by solving the KKT conditions ( 26 1 and ( 27 1 



restricted on P?? = and es<= = together with setting 



.03) 



sgn(/3^) and 2^ 



sgn(e^). We note that due to 



the conditions of the sample size n and the fraction of errors 
in Theorem [2] Xg CT Xgc T is invertible thanks to the random 
Gaussian matrix concentration inequalities (see Lemma 14 in 



Appendix |VIII-D 1. Therefore, the expressions of hr and 175 
are valid. 

To confirm that (P, e) is the optimal solution of the extended 
Lasso (|6j, in the following subsections, we will check that 
P and e chosen above obey conditions ( 28 1 and ( 29 1. In 
particular, 

1) In Subsection 



V-A 
V-B 



we show that 
we show that 



,00 



we establish 



< 1. 



OO 

that 



< 1. 

IIM 



< 



2) In Subsection 

3) In Subsection 
fp{\ n ,p)- It men follows from the assumptions of 
Theorem [2] that H^-tIIoo < min i6T \P*\ and, therefore, 
supp(/3 T ) = sup p(/3jn ) and sgn(/3 T ) = sgn'" 

4) In Subsection 



V-D 



j tI . 

we establish that ||5's , || 0O < 
/e(A„^, A n e ). It then follows from the assumptions 
Theorem [2] that ||<7s , || 00 < min,; e 5 |e*| and, therefore, 
supp(es) = supp(e^) and sgn(e s ) = sgn(e^). 



A. Verify the upper bound of 



10) 



Proof. First, we define a notation which will be used through- 
out the rest of the paper. Let A := By the definition of 



to 



\ n ,i3 and A n , e in (19 1, we have 



A 



7 



1 



We state two supporting lemmas whose proof are deferred 
to the end of this section. 



77 log p 



(32) 



2^ max{p H , At ax} 
where we introduce another shorthand notation p u = 

Pu{^T"\t)- 

From the expression of j3 = 0* + h and e = e* + g with 
h T a = 0, gs" = and Kt, gs defined in ( |30| > and pi) , we 

e) together with 

w — 



Lemma 3. Denote z — -j=\X* ST sgn(eg) — sgn(/3£). Define 
the event 



£, := 



AtaxSlogp 

n 



substitute into zjf c ' = -^—X^ c (y — X/3 
noticing that XX, c Xt — X% Ttl Xgj- = X 
Xg Tc ws = Xg CTC wsc to arrive at 



ScTc Xs<=t, X T 



Then, P(£ z ) > 1 - 2 exp(- logp). 

Lemma 4. For any e € (0, 1), define the event £ — {M < 
M}, where 



nX 



-Xga Tc HS'=TWS<= 



Xgc T c Xs<=t{Xsc T Xs<:t) 



(33) 



M := -X 2 s + 1 + max i e,4 

n \ v n — s 



Here, we define ILs<=t ■= I — X S ct{X* S c T X S ct) l Xg CT 
which is an orthogonal projection onto the column space of 

X S c T and z := -^XXt; T sgn(e* s ) - sgn(j3£). 

We can further simplify the expression of zif) by denoting 

' 7s As S n ( e s) ' 



CT 2 (n — s) 



k 1 + A 



V 




(40) 



(n - s)C, 



then we have 



ITscTWS<: — Xs<=T(Xga T Xs<=T)~ 



z 



(34) 



(35) 



defined in 



in which the first is 



Then, ¥(£) > 1 — ci exp(— 02(71 — s)e 2 }) for some universal 
constants C\,C2 > 0. 

Conditioned on the event £ defined in Lemma [4] the 
probability P(maxi 6 T c \h\ > 7) is upper bounded by 

P(max \bi\ > 7 I £) + cxp(-c 2 (7i - a)). 

We recall that 6, is a zero-mean Gaussian random variable, 
thus the standard Gaussian tail bound in ( |6Tj ) allows us to 
derive 

Pfmax \bi\ > 7 I £ ) < 2<p - k) cxp ( ^= 

This exponential probability decays at the rate of 
18}. Therefore, z^J consists of two components exp(-clog(p - fc)) provided that \2p u Mlog(p - k) 

is strictly less than one. Now we replace the definition of 
M in (40 1 into this inequality. To do this, we notice that 
— o(l) from the sample size assumption of Theorem 



Conditioning on Xt, the matrix X^ c can be decomposed into 
a linear prediction plus a prediction error 

Xrp c — YjJ^Cj^Yjrp^Xrp ~\~ Erp c ^ (36) 

where each row of the matrix Es^t^ is a 7V(0, E^c| T ) 
Gaussian random vector whose entries are i.i.d and E T c| T is 



and the second is 



En 



(37) 



Since ILs c t is the orthogonal projection onto the space 
spanned by columns of the matrix Xs^t, we have 
Xg aT HscT = 0. Thus, a can be simplified as 

' ' ' 'Sgn(elj)) - £ T c T £- 



thus we can select e 6 (0,1) such that 4^ 



< e. 

"Following some simple algebra, we find that it is sufficient to 
have 



> 



2p„ 



-^k\og(p-k)x 
1 + e C min 7 



Z^jTcfYjrp^^XXgrp ; 



(38) 



< 



1 + 



At ax S log p 



C min (» - s) A 2 s 
(l + e)A; n 

(n - sf g 2 C min 

« 2 A ',/3 fc 



The mutual incoherent assumption in (17i gives us ||<z m 
1 — 7. All that left is to establish the foo-norm of the second 
component: \\b\\ < 7. Denote A as the i-th column of the 
matrix Et<= and condition on Xs<=t, the i-th coefficient of the 
vector 6: — (A , v ) is a Gaussian random variable with 
variance Varf^) := v*EEiE*v < p u \\v\\ 2 where ||w|L is 
quantified as, 

M := — 



Replace the expression of A in ( 32 1 and s = lyn and perform 
some simple algebra, we conclude that the l^, -norm of z^i 
is strictly less than one as long as the following bound of the 
sample size obeys 

1 2pu 



n 2 X 2 



\n S c T w S c\\; + z*(x* saT x S a T y 



2(1 + e) > (1 - n) C min7 2 
f 9 



k \og(p — k) 



(39) 



1 / -1 \ 2 Cmin 
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which matches with the assumption of Theorem [2] □ 

Proof of Lemma [i] Recall the expression of z in the 
lemma, we have by the triangular inequality, ||z|| < 
t= H-X'sr sgn(eg)|| (X) + 1. Furthermore, we know that the 

1 /2 

matrix Xst can be represented as Wst^tt where Wst € 
R sxfc is the random matrix with i.i.d. zero mean entries and 
unit variance. Hence, 



\Xst sgn(es, 



S^W^ r sgii(e|) 



< 



Wst sgn(e^ 



where the inequality follows from matrix sub-multiplicative 



norm and 



a/2 



< lis 



TT | 



1/2 _ 



Consider the random variable Vi — (wt, sgn(e* s )) where Wi 
is a column vector of Wst- Recall that each entry of Wi is 
JV(0,1) and ||sgn(eg)|| 2 = s/s. Hence, Vi is a Gaussian r.v. 
with variance s. Applying Gaussian tail bound (61 1 in the 
Appendix together with taking the union bound yields 

PdlV^sgn^lL > r) < 2fcexp(-r 2 /2 S ). 

Selecting r = 2^/slogp so that the probability exponentially 
decays to zero. Combining these inequalities completes the 
proof of Lemma [3] □ 

Proof of Lemma |5] Since IIs^t is the orthogonal projec- 

In addition, 



tion matrix, we have lingeries 

|2 2 



< 



12 - \\ W S" 

^2 || || 2 is the X 2_var i ate w i tri i n — s ) degrees of freedom, 
thus 



^ 2 A 



n,/3 



„ ,,2 / s cr 2 (n — s) 

\n s . T w s 4 2 2 > (l + e ) 1 ^ 

"V/9 



< 2exp 



3(n- s)e 2 
16 



Turning to the last term of M, by the spectral norm bound of 
the Gaussian random matrix d66ll, we obtain 



(n - s)C n 



with probability greater than 1 — ciexrjf— C2(n — s)). Con- 



ditioned on the event £ z in Lemma 

2 

|2 ^ i [ -\ i \ /D+„ x slog 



we have 



< 



< A; 1 + A 



by combining these bounds. 



B. Verify the upper bound of 



The proof is completed 

□ 



Proof. By replacing expressions of j3 and e into z 



(y S o - X S c(3), we get 



1 \/i i 

7=r — n5c T ui5o H — — Xsct(X S c T Xs c t) z, 



(41) 



where we use the same notations of Hs<=t and z as in the 
previous section: Hgc T := I—Xga T (Xg cT Xs<:T)~ 1 X% cT and 



-j=\X* S T s g n ( e s) — s S n (/^r)- To show that 



< 1, 



we bound ^-norm of each term of the sum (41 1 separately. In 
particular, we will establish that with probability conve rging 
to one, the foo-norm of the first term is bounded by 



An,e\/n 

'). The 



and that of the second term is less than (1 - - -=■ 
proof is therefore completed by the triangular inequality. 
We begin by establishing the ^-norm of the first term of 



in (41 



\fn\ r , 



\R S a T WS<= 



max ■ 



/nAr. 



\{Ui,W S c)\ 



where u\ is a column vector of Hs^t- Since 



*Jn\„ 



\Ui,W S c 



is a sum of Gaussian random variables with zero mean and 
||uj|L, it can be bounded by the Gaussian tail 



vanance 



I! A;, 



inequality in (61 1 in Appendix VIII-D Notice that spectral 
norm of any orthogonal projection is one, ||u, 
have 



< 1. We 



1 



fn.X 



\(ui,w S c) \ > r < 2cxp - 



Choose t = ^ 
columns of the matrix Us^t, we have 



2a 2 



and take the union bound over all \S C 



> 



Next, we controlthe 



2ay / \og n 

< 2|S' c |exp(-21ogn). (42) 

upper bound of 
The following lemma, 



whose proof is deferred to Appendix VIII-A establishes this 
bound. 

Lemma 5. Under the assumptions of Theorem [2] for any 
vector z € M. k independent with Xs»t, the following statement 
holds 



T \\X S c T (X ScT X S ct) z|L<-[l- x — ^ 



with probability greater than 1 — C\ exp(— C2 max{log(p — 
fc),log(n-s)}). 

Since sgn(/3^) and Xg T sgn(etj) are statistically indepen- 
dent with Xs"T, z := -^\X* ST sgn(etj) — sgn(/3£) satisfies 
the assumption of Lemma [5] Moreover, by Lemma [3] and the 
definition of A in ( 32 1, we have with high probability 



<D 



+ 

max 



\\4oo< 1 + M 

v 11 

where the last inequality holds 
of Theorem [2] Now, applying 



slog_p 3 
- 2' 



A 



\Xs<=t{Xs c t-^S"t) 



from the assumption 
Lemma [5] leads to 



< 1 - 



A„,e y/n 



Putting these two bounds together and using the triangular 



inequality we conclude that with high probability, 
1 as claimed. 



< 
□ 



12 



C. Establish the bound of fix — Pt 

Recall the formula of (/3y — (3 T ) from (30i, the triangular 
inequality yields 



< 



(X^ T X S c T )- 1 (^AX| T sgn(eJ) - sgn(^)) 



71+75. 



(43) 



To bound the first quantity, we consider a random vector 
u = {^ i X* ScT Xsc T )~ 1 ^r s X* ScT wsc and note that T\ = 
ll u lloo- This bound, which is stated below, has been established 
in equation (42) of [8]: there exists some numerical constant 
c such that 



Ti > 20, 



' er 2 log k 
C min (n - s) 



< 4exp(-c(n - s)). (44) 



Turning now to the second quantity 7i- We have 



r 2 < 



nX 



n,/3 



Xg cT Xs<=T 



where z := ^XXg T sgn(eg) — sgn(/3£). To bound T2, we 
follow similar arguments in |8 1, Section V.B. We can now state 
the following lemma, which is modified from Lemma 5 of 1 8 1 . 

Lemma 6. Let z 6 R k be a fixed nonzero vector and W £ 



ix k 



be a random matrix with i.i.d entries Wij ~ A/"(0, 1). 



Then, there exists positive constants C\ and c 2 such that 

-1 



IT 



11 



-I 



k X k 



> Cl 



k \og(p — k) 



11 



< 4exp(— c 2 min{fc, log(p — k)}). 

Following similar arguments as in [8|, Section V.B, we have 
a similar probabilistic bound as equation (41) of JH 



T2 > ClX n .p\ 



' knlog(p — k) 



1-1/2 



YT TT l2 z 



(n — s) 2 

< 4exp(— c 2 min{£;, log(p — k)}). (45) 
Furthermore, Lemma [5] states that H^H^ < 3/2 with high 



probability. Conditioning on the event £ = -Q 



we have 



12 TT /2 z 



< 



-,-1/2 

-*TT 



. Thus, (45 1 leads to 



< 3/2}, 



' kn\og{p — k) 



1-1/2 



£ 



(n — s) 2 

< 4exp(— c 2 min{fc, log(p — k)}). 



T2 > C2\ n .[j\ 



By the total probability rule, P(7 2 > r) < f{T 2 \£ ) - 
Therefore, we conclude that with probability greater than 1 — 

6 exp(— c 2 min{fc, log(p — fc)}), 



k \og{p — k) 



-,-1/2 



(46) 



(1 — r\) 2 n 

Overall, combining the bound of T2 with the bound of 71 in 



( 44 1 concludes that 



< fis(Xn,p) with probability 



at least 1 — 10 exp(— C3 min{fc, log(p— k)} where fp(X n: p) is 
defined in ( [21) . 



D. Establish the bound of es - 

Recalling the formula of eg — et 
triangular inequality, we get 
1 



in ( 31 1 and applying the 



< —= \\XsT(Xgc T XscT) 1 X 

+ A„ i( 3\Ai \\Xst{X* S c T Xsct)~ 1 z\ 
1 



S"T W S"\ 



'II 



■=Ti 



- r 2 + r 3 + a„„ 



(47) 

where we again denote z — -^XXg T sgr^ej) — sgn(f3 T ). We 
first consider the easiest term 7i = -4= H^slloo- Since ws 
is a random vector with i.i.d. A/"(0,ct 2 ) entries, by Gaussian 

extreme order statistics [37 1, T 3 < 2y / g2 ^ gs . 

Turning to the first term 71, we define a vector v € R s 
whose entries are %\ := Xi(X ScT Xgc T )^ 1 X ScT ws'= where 
Xi is the i-th row of the matrix X$t and notice that 71 = 
IMloo- Conditioned on Xt, it is clear that Vi is a zero 
mean random variable with variance a 2 Xi(X ScT X s^t) -1 x* . 

In addition, we recall that Xt can be represented as Xt = 

1/2 

WtS tt where Wt is the n x k standard Gaussian ma- 
trix. Thus, x^X^Xsct^x* = w 1 {W^t w s-t)' 1 w* < 
\\ w i\\ 2 || (^S c t^ / S' c t) _1 ||> where Wi is the i-th row of matrix 
Wst- In short, Vi is a zero mean random variable with vari- 
\ ance at most a 2 := a 2 \\wi\\ 2 2 \\{Wg cT W s <=T)~ 1] ^ Using the 
\ z \\oo I concentration result for x 2 -variate, we get ||u>i|| 2 < 2fc with 



probability at least 1 — cxp ( — k /2). Furt hermore , from random 



VIII-D 



matrix theory ( 65 1 in Appendix 

with probability at least 1 — exp( — (n 



(W^Ws^t)' 1 ] 
s)/2). 



< 



Next, let us define the event 



£ 



a 2 > 



lOcHfc 



From the above arguments, we have P(£) < exp(— (n — s + 
k)/2). By the total probability rule, we have 

P(Ti > t) < P(7i > t\£ c ) + ¥(£). 

Conditioning on £ c , Vi is zero mean Gaussian with variance at 



most 



lOeHfc 



|VIII-D[ we derive 

P max \vi 
\ ies 



Thus, by the Gaussian tail bound (|61b in Appendix 



> t < 2sexp 



(n 



10a 2 k 



Setting T = \J 2Qcr ~~ ~ yields the fact that this probabil- 
ity vanishes at rate 2(p — Overall, we can now conclude 
that 



Pf Ti > n ] J a2k] ^ kl j <2exp(-log(p-fc)). 

It is left to bound 7^. By sub-multiplicative norm inequality, 
T2 is bounded by 

K^llXsTlLUX^XscTrhll , 



13 



We already established n\ n ^ || {Xq cT Xs"t) 1 z\\ ao m '46 1. 
In addition, ||-X"sx|loo — V^||^st|| where by the matrix 

< 4C max (s + 



A. Lower l^-norm bound of z)f, 



X* st X S t\ 



theory ( 66 1 in Appendix VIII-D 

Vsk) with high probability. Thus, H^stII^ < VC- mayi (sk + 
£Vsfc) 1/2 . 

Overall, combining with the bounds of 71 and 7i, we 
conclude that fix — ft^ < fp{X n ,fi) with probability at 

oo 

least l-10exp(-c 3 min{fc,log(p-fc)} where / e (A„^,A„ ie ) 
is defined as in d22l. 



Recall the expression of z^> in d33b and its simplified form 



b where b and a are defined in (37 i and (38 1. We 



VI. Proof of Theorem[3]- Inachievability 

Our analysis in this section relies on the the notion of 
primal-dual witness introduced by Wainwright |8|. In partic- 
ular, we will construct a pair of primal solutions (/3,e) and 
their dual vectors {z^\ z^). The extended Lasso (K5JI fails 
to correctly identify signed support of the coefficient vector 
j3* and the error e* when the ^-norm of either Zj!) or Zg} 
exceeds unity with probability approaching one. The primal- 
dual witness is constructed as follows: 

1) First, we obtain the solution pair (/3x, es) of the follow- 
ing restricted Lasso problem 



already have \\a\\ < 1 — 7 due to the mutual incoherence 
assumption. It is now sufficient to show that max, e T<= \bi\ 
exceeds (2 — 7) with high probability. 

Conditioning on Xt and w, the vector b is zero-mean 
Gaussian with covariance matrix ME^c| T where the random 
scaling form M has the form (39i. The following lemma 
controls the lower bound of this scaling factor. The proof is 
similar to that of Lemma 6 in [8 1, so we omit the detail here. 

Lemma 8. Define the event E = {M > M}, where M is 
defined in {491 Then, P(£) < 1 — C!exp(— c 2 (n — s)) for 



some C\ , c 2 > 0. 

Following the proof of Theorem 4 in 0, we have the 
following lower bound: for all 1/, e, r > 



max I bi 



> 



v /(2-^p / (S TC |T)Mlog(p-fc) - r (50) 



with probability at least 1 — 2 cxp 2 ^ [p ^ . Now, using 
appropriate choices of {r, v, 7}, it suffices to establish the 
bound 



1 

2^ 



\ys - XstPt - Vnes\\ 2 +X n ,fi ||#r||i+A n>e ||e s || 



(48) 



pi(T, T c lT )M\og(p- k) > 



[(2-7) + ' 



(51) 



We also set fix" = and egc = 0. 



2) Second, we select z^ and z ( a' as elements of the 



subgradients 



and || elk, respectively. 



3) Third, we solve for vectors z^} and z^l satisfying the 
KKT conditions in d26l). We then verify whether the 



(2 - v) ■ 

We consider two cases: 

1) If M_ — > +00 or M_ = 6(1), then we can choose r 2 = 
8M_\og(p — k) for some S > 0. For S sufficiently small, we 
conclude from pO] ) that with probability converging to one, 
there exists some constants c > such that 



IP) 



< 1 and 



dual feasibility conditions of both 
|| ego || < 1 are satisfied. 
4) Fourth, we check whether the sign consistency z^ — 
sgn(/3J.) and z^ — sgn(eg) are satisfied. 

The following result summarizes the use of the primal-dual 
witness construction in providing the proof of Theorem [3] 

Lemma 7. If either steps 3 or 4 of the primal-dual construc- 
tion fails, then the extended Lasso fails to recover the correct 
signed supports of both /3* and e*. 

The proof of this lemma is essentially similar to that of 
Lemma 2(c) in [8], thus we omit the detail here. 



max|&j| > cy/\og(p - k), 

which exceeds (2 — 7) regardless of the choice of the sample 
size n. 

2) Otherwise, M = o(l). This is satisfied only if k/n = 
o(l) and thus, the second line of the definition of M_ is applied. 
Now, we can select r sufficiently small and have a guarantee 
that —> +00. From the definition of M, one can see that if 

\og(p — k) > 2, we can choose r and v strictly positive 



A-.s 



Pi 

but arbitrarily close to zero such that 



[(2-7)- 



(2-u) 



< 2. Thus, 



(51 1 obeys regardless of the selection of the sample size n. 
Consequently, we assume that 



sgri(/3£) and z ( s e) 



In our proof, we assume that Zrf 
sgn(ej); otherwise, the sign consistency would fails. Under 
these assumptions, it is easy to check that the solution (/3<r, es) 
of the optimization (|48| is expressed in <[30|> and 



A < 



2n 



pis\og(p-k) 



we can derive equations of z^} and z$l as in (133 



3T). Thus, 
) and |4l| . 

In the following two sections, we establish the claim by 
showing that under the conditions of the sample size n and 
s = r/n as in Theorem [i] the ^oo-norm of either Zj$ or z^} 
exceeds unity with probability tending to one. It is clear that if 
the extended Lasso |6]) fails to recover signed support vectors 
with s = r]n, it also fails to do so with s > r/n since it is 
easier to solve the extended Lasso when there is less corrupted 
observations. 



Under this assumption, we can lower bound 



(52) 
as follows 



sgn(^) - 

> ||Bgn(#.)|| a 

> Vk — A 



1 



XX* ST sgn(es) 



As 
1 



shown during 

\X* T sgnCeDH^ 



n 
the 

< 



^A||X* T sgn(e*)|| 2 
v n 

X*STsm(es)\\oc- 



(53) 



proof of Lemma [3] that 
piSlog(p - fc) with probability 



14 



M 



n 
X 2 s 



if k/n = 9(1) 
if k/n = o(l). 



(49) 



logp), from the above 



greater than 1 — cxp( A 

upper bound of A, we obtain ^= ||X^ T sgn(eg)|| oo < 
Consequently, we achieve the lower bound with high 
probability 



Ml: 



Furthermore, for (n — s) sufficiently large, we select a e E 
(0, 1/2) such that A^J < e. Now, replace this bound into 
the second equation of M_ and perform some simple algebra, 
we can show that the inequality ( |5Tj ) is satisfied as long as 



Pi k log(p - fc) 

Cmax (n — s) 

1 (n 
+ 4 + - 



C max A s(?i — s) 
(1 - e)kn 

— s) 2 <T 2 C max 

« 2 A 2 j/3 fc 



> 



[(2-7) + r] 2 
(2-i/)(l-e)' 



Replace the lower bound of A in ( 56 1 and s — nn into the 
above inequality, we can conclude that the inequality ( |5T| is 
satisfied as long as 



Pi 



2fclog(p- k) J 3 



C max (2- 7 ) 2 (n-s) 

|(2- 7 )+t] 2 



+ (1 - V) 



2 ^" Cmax 



> 



(2- 7 ) 2 (l-^/2)(l- e )- 

Under the assumptions of Theorem [3] the right-hand side is 
strictly greater than one. On the other hand, r, v and e are 
parameters that can be chosen in (0, 1/2). By selecting these 
parameters to be positive but arbitrarily close to zeros, we 
can set the right-hand side less than one. Therefore, pT) is 
satisfied. 



B. Lower the ioo-norm bound of z s 



Recalling the equation of z s 

Jfi) _ 1 



in 



<0> 



we have 



X 



-Xs<=t{X* SsT Xs<=t) 



where we recall z = ^=A-Xg T sgn(eg) — sgn(/3j.). First, 
notice that IIsc T is the orthogonal projection onto the column 
space of the matrix Xs^t- Thus, two terms in the above 
summation are orthogonal to each other. Therefore, lowering 



the I 



,0-norm of Zgl by its ^2-norm counterpart, we have 



(n 



-s) 
1 

nXl 



> 



in 



s-tw S c 11 + 



n 
A 2 



\Xs"t{X* S c T Xsct) 



From this inequality, we have an important observation that 
both terms in the sum have to be upper bounded by (n — s). 



Otherwise, 



is automatically strictly greater than one, 



regardless of the choice of the sample size n. This observation 
suggests to us the required lower bound of A„ ;e and A: 

l\Hgc T Wsc\\ 2 , 



A ™' e - n 7 

y/n{n - s) 



(54) and 



A > 



\Xso T (Xgo T Xs<=T) * 



We now explicitly establish the lower bound of these regu- 
larization parameters. First, since \^-S c tWs" || 2 is the % 2 
variate 



with n — s degrees of freedom, Lemma [T3] in Appendi 

„,,r,r,a^t^ tUnt 1 I I TT _ _„,,„ II 2 >. W „ „\ „,U 



VIII-D suggests to us that \\^S c Ti^sc\\ 2 > \{n — s) with 
probability at least 1 — exp(— (n — s)/16). Consequently, we 
require 

'a" 
2n' 

Furthermore, we observe that with probability converging to 
one 



A,, 



(55) 



\Xs<=T(Xg cT Xs<:T) 



= z* {X* S o T Xs^t) 



= z^-^iw^ws^r 1 ^ 



^1/2. 



> 



-1/2. 



^{W^Wsot)- 1 ) 



~ 2n C ' 



-1 

max . 



4 



where the second identity follows from the decomposition 

1/2 

= Y,rp T Ws<=T an d me l ast inequality is due to the 



Xs"T 

Gaussian random matrix inequality ( |63] l in Appendix |VIII-D 
In combination with the lower bound of ||z|| 2 , we require 



A> 



8C max (n - s) ' 



(56) 



Turning to establish the lower bound of 
show that under the assumptions of Theorem [3] this quantity 
is strictly greater than one. By the triangular inequality, 



> 71 — 72 where 71 is quantified as 



\Xs<=T(Xgc T Xs<=TY 



and the other term is 72 := 



|n S c r w S c 



As 



shown at the beginning of Section V-B we have the fol- 



lowing inequality to hold with probability greater than 1 

2exp(— log(n — s)): 

2^2^ 



r 2 < 



s) 



It is now left to justify that under the assumption of Theorem 



71 > 1- 



2y/g2 log 



The remainder of this section is devoted 



to establish this claim. In what follows, we state two important 
lemmas, which are the main factor in establishing the lower 



15 



bound of T\. The proofs of these lemmas are again deferred 
to the Appendix. 

Lemma 9. For any vector z G R k independent with Xs<=t, 
we have with probability greater than 1 — cxp(— log(rt — s)) 



Xs"t{X* S c T Xsct) i 



1 



< 16 



-Xs c T^TT 



\z\\ 2 y/2k\og{n - s) 



Lemma 10. With probability at least 1 — 4exp(— \ log(n— s)), 

H Y v -i || ^ 2||z|| 2 y/logjn-s) 
ii 01 11 \\oo — i rr> 



Once these two lemmas are established, we can now show 



that under the assumptions of Theorem 3 71 > 1 + y 7 ^L" 



with high probability. By definition, z = -^\X* ST sgn(etj) — 
sgn(/3y), one can see that z is independent from Xs<=t- 
Thus, by Lemmas [9] and [10] and the triangular inequality, 
we have, with probability at least 1 — exp(— log(p — k)) — 
4exp(-|log(n- s)), 

\\Xs<=T(Xg cT Xs'=T)~ 1 z\\ nc> 

1 



> 



> 



I X ScT^TT z I 



16(l + e)^= ||z|| 2 



2>/log(n- s) 



3(n — s)yCn 



v/log(n - s) /256(l + e) 2 fcC n 



(n - s)y/U a 



(n - s)C n 



14 



(57) 



Recall from the pre vious section that we require the upper 
bound of A in (52i. Otherwise, z£) is strictly greater 



than one regardless of the choice of the sample size n. This 
upper bound of A leads to the lower bound of ||z|| 2 in (54i. 



Furthermore, assuming that n — s > c ^, max k for some large 
enough constant c, we achieve 



\x s , T (x* s . T x s . T )-'z\\ x > i(i - c )^Mg. 



Therefore, the requirement 71 > 1 + 2 V /gr [ s equivalent 



to 



2 



(n - s) 2 < 



(1 — e) / 2y / CT 2 logn\ fcnlog(n — s) 



G 



A, 



A 2 C n 



,(e) 



Replace the upper bound of A in ( 52 1 and s = r/n, the above 
inequality, or equivalently, 
the sample size n obeys 



> 1 is satisfied whenever 



n < 



n 



pi 



12 (1 - ?7) 2 C max 
7 2\J a 2 log n 



x 1 + 



A n ,i 



fc log(n — s) \og{p — k) . 



VII. Conclusion 

In this paper, we studied the l\ -constrained minimization 
problem for sparse linear regression when the observations are 
grossly corrupted. We proposed the extended Lasso method 
which is a natural generalization of the Lasso for recovering 
both the regression and the error vector effectively. Our main 
contribution was to establish that this recovery is faithful, 
under both parameter estimation and variable selection cri- 
terions, even when the error magnitude is arbitrarily large and 
the fraction of error is close to unity. Specifically, our first 
result indicated that the £2 estimation error is bounded via 
the introduction of the extended restricted eigenvalue (RE) 
condition evaluated on the combination matrix [X I]. Our 
next results considered the exact signed support recovery for a 
class of random Gaussian design matrices. We showed that the 
sign consistency is indeed possible even when almost all the 
observations are significantly corrupted. More interestingly, we 
established the lower and upper bounds for the sample size 
such that the extended Lasso succeeds or fails in recovering the 
supports with high probability. This number of observations is 
scaled in term of the model dimension p, the sparsity index 
k, and the fraction error 77 — s/ n. Notably, all of our results 
are consistent with that of the standard Lasso in the absence 
of sparse error. 

There are a number of extensions and open questions 
related to this work. First, our setup can be extended to 
robust group/multivariate Lasso model. This model has been 
shown to outperform the conventional Lasso in many practical 
applications as well as theoretical analysis (e.g. 11141 . 1151 . 
Il38ll . [39 1). It would be interesting to obtain the upper and 
lower bound of the sample size when a significant fraction of 
observations is corrupted in this setting. Another interesting 
direction is to consider a more general situation where both 
the observations and the data matrix are corrupted/missing. 
In a recent paper, Loh and Wainwright l40l established the 
consistency of the Lasso with noisy/corrupted/missing data 
matrix. Whether similar results would hold for more general 
setting is an interesting open problem. Lastly, although our 
current work focused exclusively on linear regression, it would 
be interesting to investigate the sparse additive models (e.g. 
EH, ll42l ) under grossly corrupted observations. 

VIII. Appendix 

A. Proof of Lemma [5] 

Decomposing Xs<--t as Xgc T = Ws^t^tt where 
Ws»T G ]R(™~' s ) xfe is the random matrix with i.i.d. nor- 
mal Gaussian entries, we have Xs^TiXgcrpXs^T)^ 1 = 
W s<=T{Wga T W s^t^^tt 2 '• Consider now the compact sin- 
gular value decomposition of Ws^t 

Ws-t = UDV*, U e R("- S ) xfc and D,V e R kxk . 

Since Ws°t is a Gaussian random matrix with i.i.d. entries, 
columns of U are orthogonal vectors selected uniformly at 
random. We can consider U as a random matrix distributed 
on the Haar measure. We have 



Xs<=T(Xgc T Xs<=T) 



UD^V*Y, 



-1/2, 

TT " 



16 



Using the random matrix concentration inequality in ( |64| , we 
have with probability at least 1 — e~ fe 

\ 1/2 



||W S o T || < 1 + 4 



In addition, from ( 65 I, we have with high probability 

1 



{(WsctWsot^W < 1 + 4 



k 



n — s I n — s 
Combining these pieces together, we conclude that 



Ml = WWscriW^WscT)- 1 

3/2 

/,■ \ 

< 1 + 4 



< 



— si y/n — s ^/n — s' 

assuming that k is sufficiently smaller than (n — s). 
Next, our goal is to bound 

UD^V*Y,~ 1/2 ' 



mzx^UD^V^^z] 



= max I ( [7*,D f y*E 



-1/2 ^ 

:= max\fi(U)\, 

i 

where /j is the function acting on the random matrix U, ft : 

R |S=|xfc ^ M _ 

First we show that fi(U) is Lipschitz (with respect 
to the Euclidean norm) with constant at most = 



(i+<Qfc II 

C mi „(n-s) ll^lloo- 
;|S c |x^ we have 

\fi(Ui) - fi(U 2 )\ 



Indeed, for any given pair U\ U 2 6 



Ui - U 2 ,D^V*E 



-1/2 



TT 



ZC: 



< \\Ui-U 2 \\ F 

< \\U1-U 2 \\ F 



\D*V* 



-1/2. 



-1/2 



-'TT 



\ze 



i \\F 



< \\Ui-U; 



1 



< 



2|lF V^ TsVQ 
(l + e)fc 



(n - s)C n 



Ui-U 2 \\ F \\z\\ c 



Since the distribution of U is invariant under the orthogonal 
transformation U i-> —U, f(U) is a symmetric random 
variable and zero is a median. Hence, by the measure of 
concentration with respect to Haar measure in Lemma [T5| we 
get 



Vi(P)>r)<exp 



T 2 (n — s) 



\f. 



*\\L 



2^2 



Set t 



exp 

2a"^log n 



2A (l _ 2a-yiog n \ 

over all i £ S c , we have 



C m jn(n ~ S) T 

'8(l + e)k\\z\\l) ' 

„ and take the union bound 



UD^V*T,~y 2 z 



- 2 II II 



w , / C min (n- S ) 2 A 2 / 2aVlo^ , 
< (n — S) exp t^ttz ; — ; — I 1 



12(1 + e)nk 



This probability vanishes at rate exp(— clogn) provided that 



(n-s) 2 > 12(1 + e) 1 



2cr^/log n\ nklogn 



C m in A 2 



Replacing the expression of A in ( 32 1 and s = rjn, the above 
condition is equivalent to 

U >C?(l + e) V maX ^«'- D max} 



logn 



(1 - V) 2 Cnin7 2 



where C is a numerical constant smaller than 48. 



B. Proof of Lemma [9] 

1/2 

Recall the decomposition of Xs?t'- Xs<=t — Ws^t^tt, 
we have 

Xs^TiXgaxXs^T)^ 1 z Xs?TZ 

n — s 

Ws.TiWs^Wsor)- 1 — W S , T ] ir 1/2 z. 



Notice that W$"T is an (n - s) x matrix with independent 
Gaussian entries with zero mean and unit variance. Consider 
now the reduced singular value decomposition of Ws*t 



W S c T = UDV*, U G R( n - S *> xk and D,V G 



ofex/c 



Then the columns of £/ are A; orthonormal vectors selected 
uniformly at random. We can think of J7 as a random matrix 
distributed on the Haar measure. The above equation is now 
formulated as 

-l 



1 



-UD 



n — s 



D*D 

n — s 

D 



-I 



VY,- 1/2 z =: UDVY,~ 1/2 z. 



< =biiwwii ( ws 7 w r T ) 



It is clear that 

Recalling the random matrix concentration bounds d64j) and 



( 65 1, we have 



U 



D 



4y/k I 
< —— 1 + 4 



< (1 + 4y / ^) 1 / 2 . Therefore, 

-. \ V2 



=: (1 + e) 



4^ 



where we choose e > A^/k/ {n — s). 

Our goal now is to establish an upper bound of 
ubVYr x l 2 z , which can be rewritten as 



max \e*UDVZ- 1/2 z\ = max I ( U, DVH~ 1/2 ze* 



:= max fi(U), 

i 

where f is a function operating on the random matrix U, 
fi : R( n - S ) xk n- K. 

First we show that f%(U) is Lipschitz (with respect 
to the Euclidean norm) with constant at most = 
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^n-s k VCmte Indeed ' f ° r ^ 8 iven P ail " 1/1 ^2 G 

jg>|S c |x^ we jj aye 



-1/2. 



|/*(C7i) - /,(Z7 2 )| = I {Ui - U 2 ,DV*E~^ z, 

< \\Ui-U 2 \\ F 

< \\Ui-U 2 \\ F 



DV*E T r /2 ze* 



DV* 



V _ II * II 



2Nf 



4(l + e)7fc 1 



At the second step, we need to lower the bound 
E maxi fi(Ws"T)- This can be estimated via Sudakov- 
Fernique inequality [37']. We have, 



E(/i(l*W) - fiiWscr)) 2 = 2z*^ T z 



Consequently, if we denote gi, 1 < i < (n — s) as a sequence 

— 1/2 

E TT ' 2 ) Gaussian random variables, then we have 
established a lower bound 



ofjV(0, 



Since the distribution of U is invariant under the orthogonal 
transformation U H> —U, f(U) is a symmetric random 
variable and zero is a median. Hence, by the measure of 



concentration with respect to Haar measure (Lemma 15 1, we 
get 



'(/i(E0>T)<exp 




C min (n - s) 3 r 2 
128(1 ■ 



exp 



2 k\\z\\ 



c it! 2 256(l+e) 2 ||z||,fclog(n-s) , . , . t . 

Setting t := — - — c (n- s )3 and ta kmg the union 

bound over all i € S c , we have 



E(/i(W S c T ) - /j(WW)) 2 > E(ft - 3j ) 2 

Therefore, the Sudakov-Fernique inequality [37 1 suggests that 
the maximum over f(tVi) dominates the maximum over gi. In 
particular, we have Emax, fi(Ws°T) > Emax t gi. Moreover, 
since {gi} are i.i.d. random variables, by the standard bound 
for Gaussian extreme, for all 6 > 0, we have 



E max /(Ws^t) > Emaxj. 

i 

> 



V(2-<y)log(n-s). 



Substituting this expectation bound into ( 59 1 yields 



UD^V*T, 



-1/2. 



> 



16(1 



\z\\ 2 ^k\og(n 



\/C min (n - s) 3 

< exp(-log(n- s)), 



i 

2 



V^og(ri - s) 



> 



^.Jrj-irj-i Z 



^\og(n - s) 



as claimed. 



C. Proof of Lemma 10 



1/2 

We have Xs^t = Ws^t^tt' where Ws°t is a standard 
Gaussian matrix of size (n — s) x fc. Thus, JCgcyEyj,^; — 
— 1/2 

Wsc T S TT ' 2, which leads to 



|Xsc T I] T yz| 



= max I ^ej, WsctS t J/ 2 z 
=: max|/i(WW)|, 



where e, € IR( n-s ) is the standard vector whose entry at i-th 
location receive unit value and zeros elsewhere. In order to 
lower the bound of the random variable max^ fi(Ws°T)> the 
first step is to show that it is sharply concentrated around its 
expectation. 

Lemma 11. For any r > 0, we have 



max /i(WW) - E max/, (VK S c T ) I > t 

i i 

( 

<4exp T n? I . (58) 



V 



for 5 arbitrarily close to zero. Furthermore, using the standard 



bound 
proof. 



v-y 2 z 



> 

2 ~ lis 



1/2 
TT 



> 



we complete the 



Proof of Lemma 11 By the standard Gaussian concentration 
theorems [37 1, let w be a standard Gaussian measure on M." 
and / be a Lipschitz function with Lipschitz constant ||/||j ip . 
Then, 



W )-E/H>r)<4ex P (-T 2 /2||/|| 2 p ) 



(60) 



We now consider the function J(Ws"t) '■= maxj fiiWscT) 
operating on the standard Gaussian matrix Ws<=t- We have 



/(Wi. T ) - f(W§ aT ) = max (a, Wlo T ^T 2 ' 



< max le l , (Wg T — Wg T )H T ?r 2 Z 



max ( efc, Wgc T Y> T y 2 z 



< 



\w ST -wl T \\ F 



Select t 



-1/2. 



I log(n 



s), we conclude that with 



probability greater than 1 — 4exp(— j log(n — s)) 

nMK/i(WW) > Emax/i^scx) - r. (59) 



where the second inequality follows from the Cauchy- 



YT T x 4 2 z 



Schwartz inequality. Applying (6O1 with Lipschitz constant 
completes our proof. □ 



18 



D. Some concentration inequalities 

In this section, we restate some well-known large deviation 
bounds for ease of reference. The first is a bound of sum of 
Gaussian random variables. 

Lemma 12. Let Z\,....Z n be independent and zero-mean 
Gaussian random variables with parameters a 2 , •••,cr„. Then 

'(it^s'-K-isbf)- 

This bound comes directly from a standard Gaussian bound. 
For a Gaussian variable Z ~ Af(0, a 2 ), we have with all t > 



\Z\ > t) < 2exp - 



2fi 2 



(61) 



The following tail bounds on the Chi-square variates taken 
from P3l are useful 

Lemma 13. Let X be a centralized x -variate with d degree 
of freedom. Then for all r G (0, 1/2), we have 



' (X > d(l + r)) < cxp [ dr 2 

16 



'(X<d(l-r)) <exp 



We also recall some well-known concentration inequalities 
from random matrix theory 

Lemma 14. Let X nxk be a random matrix, whose entries 
are standard Gaussian random variables. Denote by cr m i n and 
Cmax the smallest and largest singular values of X. Then we 
have 



1 - <r min {X)/Vn > y - + r J < exp (-n r 2 /2) 



t (x)/va-i> \ - + 



< 



exp (-nr 2 /2) 



By setting r = J £ we conclude that with probability at 



least 1 — exp(— k/2), 

(1 - 2^/kfn) 2 < cr min (X*X/n) 



(X*X/n) < (2i/kJn+ if 



(62) 



A consequence of this quantity is another singular value 
bound for the inverse matrix of X*X. We have with proba- 
bility greater than 1 — exp(— fc/2), 

1 



({X-X/n)- 1 ) 

.(prx/n)- 1 ) < 



1 



(1 - 2^/kJn) 2 

(63) 



From the above two set of inequality and assumption that 
k < n, we conclude that with probability greater than 1 — 

exp(-fc/2), 



X*X 



-I 



< 4 



(64) 



X*X 



< 4 



(65) 



For random matrices whose rows are i.i.d and have distribu- 
tion J\f(Q, £), we can achieve a similar spectral norm bound. 
We have with probability at least 1 — exp(— fc/2) 



X*X 



- S 



x*x 



n 

-l 



< 4cr max (£) 



< 



O-min(S) V U 



(66) 
(67) 



Finally, the following lemma states an useful concentration 
inequality on Haar measure [44|. 



Lemma 15. Support k < n and let f 
Lipschitz norm 



ix k 



i — y R with 



I/Hz. = su p 



f(x) - fJX) 

X -Y 



Then if U is distributed according to the Haar measure, 



F{f{U) > median(f) + r) < exp(- 
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